library(readr)
library(lmtest)
library(faraway)
Conventional wisdom is that money buys happiness (winning) in Major League Baseball. However, the advent of “Moneyball” in the early 2000s by the Oakland Athletics, Cleveland Indians, and other teams, has lead to a more analytical approach to determining the make-up of Major League rosters.
As it turns out, money is not the magic elixir when it comes to assembling a winning Major League Baseball (MLB) team. The following plot shows that salary does not highly correlate with a winning record. This is substantiated by the companion single linear regression model and summary statistics that show that salary, while significant, only has a marginal impact on wins by an MLB team. The adjusted \(R^2\) from the simple linear regression – using training data from 2000 through 2013 – is low. Furthermore, the average percent error that compares actual wins versus predicted wins from the the test data (2014 through 2016) is high.
##
## Call:
## lm(formula = W ~ salary, data = bbproj_trn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.389 -8.180 0.701 8.404 35.659
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.70962155199 1.29065984697 55.560 < 2e-16 ***
## salary 0.00000011551 0.00000001468 7.866 0.0000000000000316 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.86 on 418 degrees of freedom
## Multiple R-squared: 0.1289, Adjusted R-squared: 0.1269
## F-statistic: 61.87 on 1 and 418 DF, p-value: 0.00000000000003158
## [1] "Leave One Out Cross-Validated RMSE: 10.88"
So what are the factors that move the needle?
We explore two related threads in attempting to identify the factors that have a strong impact on a winning baseball team.
The first thread uses multiple linear regression to identify the factors that influence a team’s winning record over the course of a regular Major League Baseball (MLB) season.
The second thread uses logistic regression to identify the factors that influence a team’s ability to win their division.
A few words are in order about where we obtained the data from to perform this analysis.
The source of data is the Sean Lahman baseball archive (http://www.seanlahman.com/baseball-archive/), recognized by the Society for American Baseball Research (SABR) as the leading detailed player and team data archive from 1874 through the end of the 2017 Major League Baseball season. We use team statistics from the years 2000 through 2013 as the training dataset and use data from 2014 through 2016 as the test dataset. (As a note, we were unable to include the 2017 season in our analysis due to unavailability of salary data.)
The team-based variables we examine fall into one of three categories: offensive, defensive, or administrative.
yearID - YearfranchID - Franchise, or team nameW - Wins (multiple linear regression response variable)DivWin - Division winner (Y or N factor- Logistic linear regression response variable)WCWin - Wild Card winner (Y or N factor)LgWin - League champion (Y or N factor)WSWin - World Series winner (Y or N factor)salary - Team Salary (U.S dollars not adjusted for inflation)R - Runs scoredAB - At batsH - Total hits (including doubles, triples, and home runs)X1B - SinglesX2B - DoublesX3B - TriplesHR - Home runsBB - Base on balls (walks)SO - StrikeoutsSB - Stolen basesCS - Caught stealingHBP - Hit by pitchSF - Sacrifice fliespark - Name of team’s home ballpark (factor)attendance - Home attendance totalBPF - Three-year park factor for battersGIDP - Grounded into double playsRBI - Runs Batted InIBB - Intentional walksTB - Total basesSLG - Slugging percentageOBP - On-base percentageOPS - On-base plus slugging percentageBABIP - Batting average for balls in playRC - Runs createduBB - Unintentional walkswOBA - Weighted on-base averageRA - Total runs allowedER - Earned runs allowedERA - Earned run averageCG - Complete gamesSHO - ShutoutsSV - SavesIPOuts - Outs pitchedHA - Hits allowedHRA - Home runs allowedBBA - Walks allowedSOA - Strikeouts by pitchersE - ErrorsDP - Double playsFP - Fielding percentagePPF - Three-year park factor for pitchersWHIP - Walks and hits per innings pitchedThe goal of this thread is to find an Multiple Linear Regression (MLR) model that is simple enough to explain the relationship between the response variable (regular season wins) and the predictors. This means we are interested a reasonably small MLR model that is easy to interpret. Equally important is the need for the model to predict well against the test data set from the 2014, 2015 and 2016 MLB regular seasons. This means we have to have sacrifice a certain degree of model simplicity. In short, we are trying to find a model that will allow us to “have our cake and eat it too”.
The approach we take is to employ all three seach procedures – Backward, Forward and Step – against the Akaike Information Criterion (AIC) and the Baysesian Information Criterion (BIC) quality criterion in order to find the model that best meets the goals of this thread.
We start by creating the following six initial models.
We then use an iterative process of model evaluation and refinement until we have produced a line-up of candidate models that move on to the next phase of assessment: prediction against the test dataset.
The model evaluation and refinement process uses the following six diagnostics tests.
If a model passes all six of these diagnostic tests then it is considered worthy of being evaluated against the test data set. From a nomenclature perspective, each model that passes this hurdle is called candidate model. If a model does not pass all six of these diagnostic tests we refit the model after removing a single predictor. We then perform the same six diagnostic tests. We repeat this cycle until a model is produced that passes all six of the diagnostic tests.
Once candidate models are identified, we assess their ability to predict regular season team wins against the test dataset. The candidate model with the lowest Average Percent Error and the lowest Test RMSE is declared the winner.
Please note that the response variable is designated as capital W in models that are described below.
scope <- W ~ R + AB + H + X2B + X3B + HR + BB + SO + SB + CS + HBP + SF + RA +
ER + ERA + CG + SHO + SV + IPouts + HA + HRA + BBA + SOA + E + DP + FP +
attendance + BPF + PPF + salary + RBI + GIDP + IBB + TB + SLG + OBP + OPS +
WHIP + BABIP + RC + X1B + uBB + wOBA
formula <- formula(scope)
start_model <- lm(formula, data= bbproj_trn)
n <- length(resid(start_model))
step_search_start_model <- lm(W ~ 1, data= bbproj_trn)
### Backwards Search: BIC Model
bic_model <- step(start_model, direction = "backward", k = log(n), trace = 0)
summary(bic_model)
##
## Call:
## lm(formula = W ~ R + AB + H + X2B + BB + SO + CS + SF + RA +
## CG + SHO + SV + IPouts + HRA + BBA + E + FP + GIDP + BABIP +
## RC, data = bbproj_trn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.8573 -1.9757 -0.1555 1.9118 7.4927
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1297.971349 339.322389 3.825 0.000152 ***
## R 0.076579 0.006140 12.472 < 2e-16 ***
## AB -0.116786 0.014360 -8.133 5.37e-15 ***
## H 0.306262 0.049740 6.157 1.81e-09 ***
## X2B 0.034285 0.010085 3.399 0.000743 ***
## BB 0.031072 0.006669 4.659 4.33e-06 ***
## SO 0.051706 0.009680 5.341 1.55e-07 ***
## CS -0.043025 0.015571 -2.763 0.005990 **
## SF -0.095773 0.021708 -4.412 1.32e-05 ***
## RA -0.054254 0.003811 -14.238 < 2e-16 ***
## CG 0.148895 0.046962 3.171 0.001639 **
## SHO 0.150078 0.049080 3.058 0.002380 **
## SV 0.345616 0.025203 13.714 < 2e-16 ***
## IPouts 0.057453 0.007126 8.063 8.78e-15 ***
## HRA -0.026131 0.008530 -3.063 0.002338 **
## BBA -0.008042 0.002619 -3.070 0.002287 **
## E -0.182853 0.055690 -3.283 0.001116 **
## FP -1045.006972 343.178691 -3.045 0.002480 **
## GIDP -0.044072 0.012147 -3.628 0.000322 ***
## BABIP -760.159504 137.620611 -5.524 6.00e-08 ***
## RC -0.115957 0.024057 -4.820 2.04e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.729 on 399 degrees of freedom
## Multiple R-squared: 0.9475, Adjusted R-squared: 0.9449
## F-statistic: 360.1 on 20 and 399 DF, p-value: < 2.2e-16
### Backwards Search: AIC Model
aic_model <- step(start_model, direction = "backward", trace = 0)
summary(aic_model)
##
## Call:
## lm(formula = W ~ R + AB + H + X2B + BB + SO + CS + SF + RA +
## CG + SHO + SV + IPouts + HRA + BBA + E + FP + BPF + PPF +
## salary + GIDP + BABIP + RC, data = bbproj_trn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9901 -1.9391 -0.1523 1.9737 7.8582
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1224.786161854707 345.895721393242 3.541 0.000446 ***
## R 0.080154881164 0.006191576342 12.946 < 2e-16 ***
## AB -0.115370180878 0.014302511665 -8.066 8.71e-15 ***
## H 0.294198894471 0.049836065902 5.903 7.66e-09 ***
## X2B 0.034434010453 0.010055404520 3.424 0.000680 ***
## BB 0.029747234634 0.006800119630 4.375 1.56e-05 ***
## SO 0.048815441474 0.009683749780 5.041 7.06e-07 ***
## CS -0.044627799778 0.015532577786 -2.873 0.004283 **
## SF -0.096303670130 0.021621972055 -4.454 1.10e-05 ***
## RA -0.057347471123 0.004031932044 -14.223 < 2e-16 ***
## CG 0.154701717514 0.046819638380 3.304 0.001039 **
## SHO 0.143225421132 0.048746440530 2.938 0.003495 **
## SV 0.346070608015 0.025076294754 13.801 < 2e-16 ***
## IPouts 0.058770788132 0.007084174725 8.296 1.71e-15 ***
## HRA -0.022515047089 0.008581255343 -2.624 0.009033 **
## BBA -0.008098302987 0.002628330947 -3.081 0.002206 **
## E -0.172337651761 0.056648920630 -3.042 0.002505 **
## FP -977.032306332758 348.883980530369 -2.800 0.005353 **
## BPF -0.639264997534 0.233137357795 -2.742 0.006383 **
## PPF 0.599162430181 0.234965627610 2.550 0.011148 *
## salary 0.000000008060 0.000000004559 1.768 0.077838 .
## GIDP -0.047142718419 0.012117297039 -3.891 0.000117 ***
## BABIP -723.253375876375 138.139293383544 -5.236 2.67e-07 ***
## RC -0.109319751926 0.024176237333 -4.522 8.11e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.704 on 396 degrees of freedom
## Multiple R-squared: 0.9489, Adjusted R-squared: 0.9459
## F-statistic: 319.4 on 23 and 396 DF, p-value: < 2.2e-16
### Step Search: BIC Model
step_bic_model <- step(step_search_start_model, scope = scope, direction = "both",
k = log(n), trace = 0)
summary(step_bic_model)
##
## Call:
## lm(formula = W ~ SV + RA + CG + OBP + X3B + SHO + R, data = bbproj_trn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.7419 -1.9652 0.0052 1.8342 9.5000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.123870 6.025094 6.493 0.000000000242 ***
## SV 0.427352 0.025372 16.844 < 2e-16 ***
## RA -0.076829 0.002616 -29.364 < 2e-16 ***
## CG 0.152982 0.046478 3.292 0.001082 **
## OBP 65.425764 23.940331 2.733 0.006549 **
## X3B -0.063520 0.016876 -3.764 0.000192 ***
## SHO 0.174222 0.053140 3.279 0.001132 **
## R 0.079990 0.004115 19.439 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.01 on 412 degrees of freedom
## Multiple R-squared: 0.9341, Adjusted R-squared: 0.9329
## F-statistic: 833.7 on 7 and 412 DF, p-value: < 2.2e-16
### Step Search: AIC Model ###
step_aic_model <- step(step_search_start_model, scope = scope,
direction = "both", trace = 0)
summary(step_aic_model)
##
## Call:
## lm(formula = W ~ SV + RA + CG + X3B + SHO + R + IPouts + AB +
## H + CS + GIDP + SF + BBA + HRA + E + FP + BPF + PPF + salary +
## HA, data = bbproj_trn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.5195 -1.9115 -0.1021 1.8888 8.0195
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 914.034003144572 347.045241981607 2.634 0.008773 **
## SV 0.355614506804 0.024640390657 14.432 < 2e-16 ***
## RA -0.050247955833 0.006340159551 -7.925 0.000000000000023 ***
## CG 0.153017526099 0.044231471311 3.459 0.000600 ***
## X3B -0.068574679286 0.016831959357 -4.074 0.000055748930342 ***
## SHO 0.138199572874 0.049132489791 2.813 0.005154 **
## R 0.084827624066 0.003584776861 23.663 < 2e-16 ***
## IPouts 0.061359854959 0.006880713970 8.918 < 2e-16 ***
## AB -0.052857166713 0.005932828505 -8.909 < 2e-16 ***
## H 0.049033080922 0.006741047553 7.274 0.000000000001865 ***
## CS -0.046638144061 0.015078914455 -3.093 0.002121 **
## GIDP -0.043755357335 0.011669100346 -3.750 0.000203 ***
## SF -0.052733758025 0.019580831875 -2.693 0.007377 **
## BBA -0.010442337072 0.003185502948 -3.278 0.001137 **
## HRA -0.021622522454 0.008707631554 -2.483 0.013432 *
## E -0.161327447529 0.057050198834 -2.828 0.004923 **
## FP -884.643492393877 352.491658585801 -2.510 0.012479 *
## BPF -0.629501587254 0.230983379839 -2.725 0.006707 **
## PPF 0.584792736874 0.231600448463 2.525 0.011956 *
## salary 0.000000007730 0.000000004537 1.704 0.089183 .
## HA -0.005934333997 0.004089404824 -1.451 0.147524
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.727 on 399 degrees of freedom
## Multiple R-squared: 0.9476, Adjusted R-squared: 0.945
## F-statistic: 360.7 on 20 and 399 DF, p-value: < 2.2e-16
bic_forward_model <- step(step_search_start_model, scope = scope, k = log(n),trace = 0)
summary(bic_forward_model)
##
## Call:
## lm(formula = W ~ SV + RA + CG + OBP + X3B + SHO + R, data = bbproj_trn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.7419 -1.9652 0.0052 1.8342 9.5000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.123870 6.025094 6.493 0.000000000242 ***
## SV 0.427352 0.025372 16.844 < 2e-16 ***
## RA -0.076829 0.002616 -29.364 < 2e-16 ***
## CG 0.152982 0.046478 3.292 0.001082 **
## OBP 65.425764 23.940331 2.733 0.006549 **
## X3B -0.063520 0.016876 -3.764 0.000192 ***
## SHO 0.174222 0.053140 3.279 0.001132 **
## R 0.079990 0.004115 19.439 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.01 on 412 degrees of freedom
## Multiple R-squared: 0.9341, Adjusted R-squared: 0.9329
## F-statistic: 833.7 on 7 and 412 DF, p-value: < 2.2e-16
aic_forward_model <- step(step_search_start_model, scope = scope, trace = 0)
summary(aic_forward_model)
##
## Call:
## lm(formula = W ~ SV + RA + CG + X3B + SHO + R + IPouts + AB +
## H + CS + GIDP + SF + BBA + HRA + E + FP + BPF + PPF + salary +
## HA, data = bbproj_trn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.5195 -1.9115 -0.1021 1.8888 8.0195
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 914.034003144572 347.045241981607 2.634 0.008773 **
## SV 0.355614506804 0.024640390657 14.432 < 2e-16 ***
## RA -0.050247955833 0.006340159551 -7.925 0.000000000000023 ***
## CG 0.153017526099 0.044231471311 3.459 0.000600 ***
## X3B -0.068574679286 0.016831959357 -4.074 0.000055748930342 ***
## SHO 0.138199572874 0.049132489791 2.813 0.005154 **
## R 0.084827624066 0.003584776861 23.663 < 2e-16 ***
## IPouts 0.061359854959 0.006880713970 8.918 < 2e-16 ***
## AB -0.052857166713 0.005932828505 -8.909 < 2e-16 ***
## H 0.049033080922 0.006741047553 7.274 0.000000000001865 ***
## CS -0.046638144061 0.015078914455 -3.093 0.002121 **
## GIDP -0.043755357335 0.011669100346 -3.750 0.000203 ***
## SF -0.052733758025 0.019580831875 -2.693 0.007377 **
## BBA -0.010442337072 0.003185502948 -3.278 0.001137 **
## HRA -0.021622522454 0.008707631554 -2.483 0.013432 *
## E -0.161327447529 0.057050198834 -2.828 0.004923 **
## FP -884.643492393877 352.491658585801 -2.510 0.012479 *
## BPF -0.629501587254 0.230983379839 -2.725 0.006707 **
## PPF 0.584792736874 0.231600448463 2.525 0.011956 *
## salary 0.000000007730 0.000000004537 1.704 0.089183 .
## HA -0.005934333997 0.004089404824 -1.451 0.147524
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.727 on 399 degrees of freedom
## Multiple R-squared: 0.9476, Adjusted R-squared: 0.945
## F-statistic: 360.7 on 20 and 399 DF, p-value: < 2.2e-16
### Breusch-Pagan Test on Backwards Search - BIC Model
library(lmtest)
bptest(bic_model)
##
## studentized Breusch-Pagan test
##
## data: bic_model
## BP = 19.912, df = 20, p-value = 0.4635
plot_fitted_versus_residuals(fitted(bic_model),
resid(bic_model),
"Backwards Search - BIC Model")
### Shapiro - Wilk Normality Test on Backwards Search - BIC Model ###
shapiro.test(resid(bic_model))
##
## Shapiro-Wilk normality test
##
## data: resid(bic_model)
## W = 0.99328, p-value = 0.05825
qqnorm(resid(bic_model),
main = "Normal Q-Q Plot, Backwards Search - BIC Model",
col = "darkgrey")
qqline(resid(bic_model), col = "dodgerblue", lwd = 2)
print(paste("Leave One Out Cross-Validated RMSE: ",
round(calc_loocv_rmse(bic_model), 2)))
## [1] "Leave One Out Cross-Validated RMSE: 2.81"
### Do the number of standard residuals greater than 2 exceed 5% of the total
### observations -- Backwards Search: BIC Model
std_resid_bic_model <- rstandard(bic_model)[abs(rstandard(bic_model)) > 2]
is_std_resid_gt_five_percent_bic_model <- length(std_resid_bic_model) / n > 0.05
ifelse(is_std_resid_gt_five_percent_bic_model,
"Outliers Exceed 5% of Obs",
"Outliers Do Not Exceed 5% of Obs")
## [1] "Outliers Do Not Exceed 5% of Obs"
### VIF > 5 for Backwards Search: BIC Model Coefficients
vif_bic_model <- vif(bic_model)
vif_bic_model[which(vif_bic_model > 5)]
## R AB H BB SO RA E FP BABIP RC
## 14.894205 72.403111 929.925966 12.247576 80.989904 6.226679 48.143435 48.871767 122.560946 226.450179
library(caret)
bic_model_high_vif_cols <- c("R", "AB", "H", "BB", "SO", "RA", "E", "FP",
"BABIP", "RC")
indices_to_drop <- findCorrelation(cor(bbproj_trn[,c(bic_model_high_vif_cols)]),
cutoff = 0.6)
(vars_to_drop <- bic_model_high_vif_cols[indices_to_drop])
## [1] "H" "RC" "R" "FP"
smaller_bic_model <- lm(W ~ AB + BB + SO + SF + RA + CG + SV + IPouts + BBA + BABIP, data = bbproj_trn)
summary(smaller_bic_model)
##
## Call:
## lm(formula = W ~ AB + BB + SO + SF + RA + CG + SV + IPouts +
## BBA + BABIP, data = bbproj_trn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.0090 -3.1465 0.0746 3.1017 13.2658
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -61.891791 33.514874 -1.847 0.065513 .
## AB 0.035857 0.005639 6.358 0.000000000546 ***
## BB 0.055611 0.003754 14.812 < 2e-16 ***
## SO -0.008264 0.002508 -3.296 0.001067 **
## SF 0.083700 0.033456 2.502 0.012746 *
## RA -0.070294 0.004609 -15.250 < 2e-16 ***
## CG 0.335688 0.080389 4.176 0.000036315586 ***
## SV 0.599497 0.041742 14.362 < 2e-16 ***
## IPouts -0.019466 0.010220 -1.905 0.057516 .
## BBA -0.008600 0.004530 -1.898 0.058341 .
## BABIP 118.264318 33.616837 3.518 0.000484 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.888 on 409 degrees of freedom
## Multiple R-squared: 0.8274, Adjusted R-squared: 0.8231
## F-statistic: 196 on 10 and 409 DF, p-value: < 2.2e-16
model_diagnostics(smaller_bic_model)
## [1] "Variance Inflation Factors"
## AB BB SO SF RA CG SV IPouts BBA BABIP
## 3.480184 1.210000 1.693955 1.472823 2.839735 1.243693 1.573034 2.843318 1.558630 2.279289
## [1] "Number of coefficients with VIF > 5: 0"
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 13.488, df = 10, p-value = 0.1976
##
##
## Shapiro-Wilk normality test
##
## data: resid(model)
## W = 0.99706, p-value = 0.6567
##
## [1] "Leave One Out Cross-Validated RMSE: 4.95"
## [1] "Outliers Exceed 5% of Obs"
The Smaller Backward Search : BIC model passes all six of the diagnostic tests. It is deemed as a candidate model for the next phase: evaluation of the model’s predictive capability against the test dataset.
### Breusch-Pagan Test on AIC Model
bptest(aic_model)
##
## studentized Breusch-Pagan test
##
## data: aic_model
## BP = 22.713, df = 23, p-value = 0.4777
plot_fitted_versus_residuals(fitted(aic_model), resid(aic_model),
"Backwards Search - AIC Model")
### Shapiro - Wilk Normality Test on Backwards Search - AIC Model ###
shapiro.test(resid(aic_model))
##
## Shapiro-Wilk normality test
##
## data: resid(aic_model)
## W = 0.99355, p-value = 0.07005
qqnorm(resid(aic_model),
main = "Normal Q-Q Plot, Backwards Search - AIC Model",
col = "darkgrey")
qqline(resid(aic_model), col = "dodgerblue", lwd = 2)
## [1] "Leave One Out Cross-Validated RMSE: 2.79"
### Do the number of standard residuals greater than 2 exceed 5% of the total
### observations -- Backwards Search: AIC Model
std_resid_aic_model <- rstandard(aic_model)[abs(rstandard(aic_model)) > 2]
is_std_resid_gt_five_percent_aic_model <- length(std_resid_aic_model) / n > 0.05
ifelse(is_std_resid_gt_five_percent_aic_model,
"Outliers Exceed 5% of Obs",
"Outliers Do Not Exceed 5% of Obs")
## [1] "Outliers Do Not Exceed 5% of Obs"
### VIF > 5 for Backwards Search: AIC Model Coefficients
vif_aic_model <- vif(aic_model)
vif_aic_model[which(vif_aic_model > 5)]
## R AB H BB SO RA E FP BPF PPF BABIP RC
## 15.427663 73.154978 950.861657 12.972390 82.555959 7.100542 50.740746 51.448558 85.281044 84.114946 125.780508 232.943354
aic_model_high_vif_cols <- c("R", "AB", "H", "X2B", "X3B", "HR", "BB", "SO",
"SF", "RA", "ER", "IPouts", "E", "FP","BPF", "PPF",
"IBB", "OBP", "BABIP", "RC", "wOBA")
indices_to_drop <- findCorrelation(cor(bbproj_trn[,c(aic_model_high_vif_cols)]),
cutoff = 0.6)
(vars_to_drop <- aic_model_high_vif_cols[indices_to_drop])
## [1] "RC" "wOBA" "R" "OBP" "H" "RA" "BPF" "FP"
# smaller_aic_model <- lm(formula = W ~ AB + HR + BB + SO + CS + ER + CG +
# SHO + SV + IPouts + BBA + FP + GIDP + IBB + BABIP,
# data = bbproj)
smaller_aic_model <- lm(formula = W ~ HR + BB + SO + ER + CG + SHO + SV +
IPouts + BBA + FP + GIDP + BABIP, data = bbproj_trn)
summary(smaller_aic_model)
##
## Call:
## lm(formula = W ~ HR + BB + SO + ER + CG + SHO + SV + IPouts +
## BBA + FP + GIDP + BABIP, data = bbproj_trn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3371 -2.5236 -0.0841 2.1182 10.2418
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -328.428863 67.755211 -4.847 0.00000178 ***
## HR 0.132223 0.005998 22.044 < 2e-16 ***
## BB 0.032238 0.002798 11.523 < 2e-16 ***
## SO -0.023360 0.001674 -13.952 < 2e-16 ***
## ER -0.068993 0.003633 -18.993 < 2e-16 ***
## CG 0.189266 0.057990 3.264 0.001192 **
## SHO 0.235242 0.061006 3.856 0.000134 ***
## SV 0.410129 0.030171 13.593 < 2e-16 ***
## IPouts 0.014815 0.005185 2.857 0.004495 **
## BBA -0.008063 0.003203 -2.517 0.012216 *
## FP 278.765319 69.652956 4.002 0.00007457 ***
## GIDP -0.037475 0.012793 -2.929 0.003589 **
## BABIP 316.088714 16.896493 18.707 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.435 on 407 degrees of freedom
## Multiple R-squared: 0.9152, Adjusted R-squared: 0.9127
## F-statistic: 365.9 on 12 and 407 DF, p-value: < 2.2e-16
model_diagnostics(smaller_aic_model)
## [1] "Variance Inflation Factors"
## HR BB SO ER CG SHO SV IPouts BBA FP GIDP BABIP
## 1.426241 1.360620 1.529067 3.064298 1.310541 1.907533 1.664221 1.482259 1.578259 1.270639 1.409269 1.166015
## [1] "Number of coefficients with VIF > 5: 0"
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 8.7708, df = 12, p-value = 0.7224
##
##
## Shapiro-Wilk normality test
##
## data: resid(model)
## W = 0.99241, p-value = 0.03137
##
## [1] "Leave One Out Cross-Validated RMSE: 3.49"
## [1] "Outliers Exceed 5% of Obs"
The Smaller Backward Search : AIC model does not pass all six of the diagnostic tests. The Shapiro-Wilk p-value is smaller than the threshold of 0.05. As a result, we reject the null hypothesis that the residuals are normally distributed. As such, this model is not deemed a candidate model and is not passed on to the next evaluation phase: prediction against the test dataset.
### Breusch-Pagan Test on Step BIC Model
bptest(step_bic_model)
##
## studentized Breusch-Pagan test
##
## data: step_bic_model
## BP = 3.0387, df = 7, p-value = 0.8814
plot_fitted_versus_residuals(fitted(step_bic_model), resid(step_bic_model),
"Step Search - BIC Model")
### Shapiro - Wilk Normality Test on Step Search - BIC Model ###
shapiro.test(resid(step_bic_model))
##
## Shapiro-Wilk normality test
##
## data: resid(step_bic_model)
## W = 0.99708, p-value = 0.6604
qqnorm(resid(step_bic_model),
main = "Normal Q-Q Plot, Step Search - BIC Model",
col = "darkgrey")
qqline(resid(step_bic_model), col = "dodgerblue", lwd = 2)
## [1] "Leave One Out Cross-Validated RMSE: 3.04"
### Do the number of standard residuals greater than 2 exceed 5% of the total
### observations -- Step Search: BIC Model
std_resid_step_bic_model <- rstandard(
step_bic_model)[abs(rstandard(step_bic_model)) > 2]
is_std_resid_gt_five_percent__step_bic_model <- length(
std_resid_step_bic_model) / n > 0.05
ifelse(is_std_resid_gt_five_percent__step_bic_model, "Exceeds 5% of Obs",
"Does Not Exceed 5% of Obs")
## [1] "Does Not Exceed 5% of Obs"
### VIF > 5 for Step Search: BIC Model Coefficients
vif_step_bic_model <- vif(step_bic_model)
vif_step_bic_model[which(vif_step_bic_model > 5)]
## OBP R
## 5.221787 5.499516
smaller_step_bic_model <- lm(formula = W ~ SV + R + RA + SHO + CG + X3B +
IPouts + AB + salary, data = bbproj_trn)
summary(smaller_step_bic_model)
##
## Call:
## lm(formula = W ~ SV + R + RA + SHO + CG + X3B + IPouts + AB +
## salary, data = bbproj_trn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6930 -2.0156 0.0027 1.8031 8.6432
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.64791877157 19.43171439428 0.857 0.39209
## SV 0.39794921940 0.02548723971 15.614 < 2e-16 ***
## R 0.09465788997 0.00239639476 39.500 < 2e-16 ***
## RA -0.07000030397 0.00295754734 -23.668 < 2e-16 ***
## SHO 0.17217667234 0.05222372960 3.297 0.00106 **
## CG 0.14266098090 0.04629864831 3.081 0.00220 **
## X3B -0.04901240782 0.01691431490 -2.898 0.00396 **
## IPouts 0.02307347246 0.00528945937 4.362 0.0000163 ***
## AB -0.01298730710 0.00293489287 -4.425 0.0000124 ***
## salary 0.00000001035 0.00000000449 2.305 0.02167 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.95 on 410 degrees of freedom
## Multiple R-squared: 0.937, Adjusted R-squared: 0.9356
## F-statistic: 677.2 on 9 and 410 DF, p-value: < 2.2e-16
model_diagnostics(smaller_step_bic_model)
## [1] "Variance Inflation Factors"
## SV R RA SHO CG X3B IPouts AB salary
## 1.610244 1.941649 3.209850 1.895339 1.132668 1.067672 2.091258 2.587975 1.267406
## [1] "Number of coefficients with VIF > 5: 0"
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 4.372, df = 9, p-value = 0.8853
##
##
## Shapiro-Wilk normality test
##
## data: resid(model)
## W = 0.9967, p-value = 0.549
##
## [1] "Leave One Out Cross-Validated RMSE: 2.99"
## [1] "Outliers Do Not Exceed 5% of Obs"
The Smaller Step Search : BIC model passes all six of the diagnostic tests. It is deemed as a candidate model for the next phase: evaluation of the model’s predictive capability against the test dataset.
### Breusch-Pagan Test on Step AIC Model
bptest(step_aic_model)
##
## studentized Breusch-Pagan test
##
## data: step_aic_model
## BP = 22.885, df = 20, p-value = 0.2945
plot_fitted_versus_residuals(fitted(step_aic_model), resid(step_aic_model),
"Step Search - AIC Model")
### Shapiro - Wilk Normality Test on Step Search - AIC Model ###
shapiro.test(resid(step_aic_model))
##
## Shapiro-Wilk normality test
##
## data: resid(step_aic_model)
## W = 0.99683, p-value = 0.5883
qqnorm(resid(step_aic_model),
main = "Normal Q-Q Plot, Step Search - AIC Model",
col = "darkgrey")
qqline(resid(step_aic_model), col = "dodgerblue", lwd = 2)
## [1] "Leave One Out Cross-Validated RMSE: 2.8"
### Do the number of standard residuals greater than 2 exceed 5% of the total
### observations -- Step Search: AIC Model
std_resid_step_aic_model <- rstandard(
step_aic_model)[abs(rstandard(step_aic_model)) > 2]
is_std_resid_gt_five_percent__step_aic_model <- length(
std_resid_step_aic_model) / n > 0.05
ifelse(is_std_resid_gt_five_percent__step_aic_model,
"Exceeds 5% of Obs",
"Does Not Exceed 5% of Obs")
## [1] "Does Not Exceed 5% of Obs"
### VIF > 5 for Step Search: AIC Model Coefficients
vif_step_aic_model <- vif(step_aic_model)
vif_step_aic_model[which(vif_step_aic_model > 5)]
## RA R AB H E FP BPF PPF HA
## 17.262491 5.084637 12.376016 17.104981 50.597130 51.635315 82.305380 80.349149 7.123143
smaller_step_aic_model <- lm(formula = W ~ SV + R + SHO + CG + X3B + IPouts +
BBA + HRA + SOA, data = bbproj_trn)
summary(smaller_step_aic_model)
##
## Call:
## lm(formula = W ~ SV + R + SHO + CG + X3B + IPouts + BBA + HRA +
## SOA, data = bbproj_trn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.9548 -2.9275 0.2111 2.5960 15.3693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -106.910065 24.782771 -4.314 0.000020123812728 ***
## SV 0.553970 0.033062 16.756 < 2e-16 ***
## R 0.082172 0.002520 32.612 < 2e-16 ***
## SHO 0.459052 0.068216 6.729 0.000000000057562 ***
## CG 0.295343 0.063584 4.645 0.000004589228066 ***
## X3B -0.063341 0.022615 -2.801 0.00534 **
## IPouts 0.026697 0.005825 4.583 0.000006087870449 ***
## BBA -0.027269 0.003450 -7.905 0.000000000000025 ***
## HRA -0.098460 0.009868 -9.978 < 2e-16 ***
## SOA 0.013936 0.001999 6.970 0.000000000012736 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.006 on 410 degrees of freedom
## Multiple R-squared: 0.8837, Adjusted R-squared: 0.8812
## F-statistic: 346.3 on 9 and 410 DF, p-value: < 2.2e-16
model_diagnostics(smaller_step_aic_model)
## [1] "Variance Inflation Factors"
## SV R SHO CG X3B IPouts BBA HRA SOA
## 1.468999 1.163807 1.753261 1.158178 1.034795 1.375007 1.345455 1.577315 1.440348
## [1] "Number of coefficients with VIF > 5: 0"
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 3.5705, df = 9, p-value = 0.9373
##
##
## Shapiro-Wilk normality test
##
## data: resid(model)
## W = 0.99481, p-value = 0.1694
##
## [1] "Leave One Out Cross-Validated RMSE: 4.05"
## [1] "Outliers Do Not Exceed 5% of Obs"
The Smaller Step Search : AIC model passes all six of the diagnostic tests. It is deemed as a candidate model for the next phase: evaluation of the model’s predictive capability against the test dataset.
### Breusch-Pagan Test
bptest(bic_forward_model)
##
## studentized Breusch-Pagan test
##
## data: bic_forward_model
## BP = 3.0387, df = 7, p-value = 0.8814
plot_fitted_versus_residuals(fitted(bic_forward_model), resid(bic_forward_model), "Forward Search - BIC Model")
### Shapiro - Wilk Normality Test ###
shapiro.test(resid(bic_forward_model))
##
## Shapiro-Wilk normality test
##
## data: resid(bic_forward_model)
## W = 0.99708, p-value = 0.6604
qqnorm(resid(bic_forward_model), main = "Normal Q-Q Plot, Step Search - BIC Model", col = "darkgrey")
qqline(resid(bic_forward_model), col = "dodgerblue", lwd = 2)
## [1] "Leave One Out Cross-Validated RMSE: 3.04"
### Do the number of standard residuals greater than 2 exceed 5% of the total observations -- Step Search: AIC Model
std_resid <- rstandard(bic_forward_model)[abs(rstandard(bic_forward_model)) > 2]
is_std_resid_gt_five_percent <- length(std_resid) / n > 0.05
ifelse(is_std_resid_gt_five_percent,"Exceeds 5% of Obs", "Does Not Exceed 5% of Obs")
## [1] "Does Not Exceed 5% of Obs"
### VIF > 5 for Step Search: AIC Model Coefficients
library(faraway)
vifs <- vif(bic_forward_model)
vifs[which(vifs > 5)]
## OBP R
## 5.221787 5.499516
library(caret)
high_vif_cols <- as.array(names(vifs[which(vifs > 5)]))
indices_to_drop <- findCorrelation(cor(bbproj_trn[,c(high_vif_cols)]), cutoff = 0.6)
vars_to_drop <- high_vif_cols[indices_to_drop]
vars_to_drop
## [1] "R"
terms = attr(bic_forward_model$terms, "term.labels")
terms = terms[! terms %in% vars_to_drop]
formula = as.formula(paste("W ~", paste(terms, collapse = "+")))
smaller_model <- lm(formula = formula, data = bbproj_trn)
summary(smaller_model)
##
## Call:
## lm(formula = formula, data = bbproj_trn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.2187 -2.7593 -0.0139 2.6440 11.9379
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -48.379453 5.538267 -8.735 < 2e-16 ***
## SV 0.510814 0.034581 14.771 < 2e-16 ***
## RA -0.067633 0.003559 -19.005 < 2e-16 ***
## CG 0.259678 0.063826 4.069 0.0000567 ***
## OBP 478.519624 15.245069 31.388 < 2e-16 ***
## X3B -0.058658 0.023336 -2.514 0.0123 *
## SHO 0.166892 0.073487 2.271 0.0237 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.162 on 413 degrees of freedom
## Multiple R-squared: 0.8736, Adjusted R-squared: 0.8717
## F-statistic: 475.7 on 6 and 413 DF, p-value: < 2.2e-16
smaller_bic_forward_model = smaller_model
model_diagnostics(smaller_bic_forward_model)
## [1] "Variance Inflation Factors"
## SV RA CG OBP X3B SHO
## 1.488879 2.334176 1.081155 1.107182 1.020757 1.884954
## [1] "Number of coefficients with VIF > 5: 0"
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 3.5489, df = 6, p-value = 0.7375
##
##
## Shapiro-Wilk normality test
##
## data: resid(model)
## W = 0.99807, p-value = 0.9205
##
## [1] "Leave One Out Cross-Validated RMSE: 4.2"
## [1] "Outliers Do Not Exceed 5% of Obs"
The Smaller Forward Search : BIC model passes all six of the diagnostic tests. It is deemed as a candidate model for the next phase: evaluation of the model’s predictive capability against the test dataset.
### Breusch-Pagan Test
bptest(aic_forward_model)
##
## studentized Breusch-Pagan test
##
## data: aic_forward_model
## BP = 22.885, df = 20, p-value = 0.2945
plot_fitted_versus_residuals(fitted(aic_forward_model), resid(aic_forward_model), "Forward Search - AIC Model")
### Shapiro - Wilk Normality Test ###
shapiro.test(resid(aic_forward_model))
##
## Shapiro-Wilk normality test
##
## data: resid(aic_forward_model)
## W = 0.99683, p-value = 0.5883
qqnorm(resid(aic_forward_model), main = "Normal Q-Q Plot, Step Search - AIC Model", col = "darkgrey")
qqline(resid(aic_forward_model), col = "dodgerblue", lwd = 2)
## [1] "Leave One Out Cross-Validated RMSE: 2.8"
### Do the number of standard residuals greater than 2 exceed 5% of the total observations -- Step Search: AIC Model
std_resid <- rstandard(aic_forward_model)[abs(rstandard(aic_forward_model)) > 2]
is_std_resid_gt_five_percent <- length(std_resid) / n > 0.05
ifelse(is_std_resid_gt_five_percent,"Exceeds 5% of Obs", "Does Not Exceed 5% of Obs")
## [1] "Does Not Exceed 5% of Obs"
### VIF > 5 for Step Search: AIC Model Coefficients
library(faraway)
vifs <- vif(aic_forward_model)
vifs[which(vifs > 5)]
## RA R AB H E FP BPF PPF HA
## 17.262491 5.084637 12.376016 17.104981 50.597130 51.635315 82.305380 80.349149 7.123143
library(caret)
high_vif_cols <- as.array(names(vifs[which(vifs > 5)]))
indices_to_drop <- findCorrelation(cor(bbproj_trn[,c(high_vif_cols)]), cutoff = 0.6)
vars_to_drop <- high_vif_cols[indices_to_drop]
vars_to_drop
## [1] "H" "RA" "R" "BPF" "FP"
terms = attr(aic_forward_model$terms, "term.labels")
terms = terms[! terms %in% vars_to_drop]
formula = as.formula(paste("W ~", paste(terms, collapse = "+")))
smaller_model <- lm(formula = formula, data = bbproj_trn)
summary(smaller_model)
##
## Call:
## lm(formula = formula, data = bbproj_trn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.7707 -4.0132 0.0735 3.9381 16.8375
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -130.372691160134 40.942184890385 -3.184 0.001563 **
## SV 0.691822762359 0.051293290999 13.488 < 2e-16 ***
## CG 0.451451517680 0.097869852286 4.613 0.00000534151951 ***
## X3B -0.044130287173 0.037206097668 -1.186 0.236278
## SHO 0.187051414650 0.110061882108 1.700 0.089993 .
## IPouts 0.006637885812 0.010439115229 0.636 0.525223
## AB 0.035276686983 0.005368182209 6.571 0.00000000015396 ***
## CS -0.018371247437 0.029837527171 -0.616 0.538433
## GIDP 0.026472866957 0.020953425350 1.263 0.207169
## SF 0.273938786852 0.038676525292 7.083 0.00000000000631 ***
## BBA -0.020850806318 0.005466335198 -3.814 0.000158 ***
## HRA -0.033796874832 0.016267982447 -2.078 0.038386 *
## E -0.045858530274 0.020678376355 -2.218 0.027131 *
## PPF 0.154768090993 0.063792547637 2.426 0.015698 *
## salary 0.000000022080 0.000000009774 2.259 0.024405 *
## HA -0.037682947274 0.005416312830 -6.957 0.00000000001406 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.152 on 404 degrees of freedom
## Multiple R-squared: 0.7299, Adjusted R-squared: 0.7198
## F-statistic: 72.77 on 15 and 404 DF, p-value: < 2.2e-16
smaller_aic_forward_model = smaller_model
model_diagnostics(smaller_aic_forward_model)
## [1] "Variance Inflation Factors"
## SV CG X3B SHO IPouts AB CS GIDP SF BBA HRA E PPF salary
## 1.499417 1.163642 1.187716 1.935442 1.872698 1.990606 1.231270 1.178478 1.242515 1.432725 1.817827 1.305924 1.197612 1.380530
## HA
## 2.454893
## [1] "Number of coefficients with VIF > 5: 0"
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 20.617, df = 15, p-value = 0.1495
##
##
## Shapiro-Wilk normality test
##
## data: resid(model)
## W = 0.9972, p-value = 0.6962
##
## [1] "Leave One Out Cross-Validated RMSE: 6.27"
## [1] "Outliers Do Not Exceed 5% of Obs"
The Smaller Step Search : AIC model passes all six of the diagnostic tests. It is deemed as a candidate model for the next phase: evaluation of the model’s predictive capability against the test dataset.
We now move forward with five candidate models and predict MLB regular season team wins with our test dataset from the 2014, 2015 and 2016 MLB regular seassons. (There are a total of 90 observations in the test dataset – one observation for each team for 2014, 2015 and 2016.)
While all models do a good job predicting the number of regular season wins by a MLB team, the Smaller Step Search: BIC Model produces the best results against the test dataset. This model has the lowest Average Percent Error score and Test RMSE score basis the test dataset. This model also generalizes the best against unseen observations. This means there is a high likelihood this model will perform well against test data for the 2017 and 2018 MLB seasons. This model also contains the second smallest number of predictors; and it also contains Salary as one of the predictors. This means it is easy to explain the relationship between the response variable (W) and it predictors. The fact that Salary is included as a predictor supports the original premise that Salary while not a dominant predictor is a marginally sigificant predictor of regular season MLB team wins.
The following table summarizes the key evaluation factors used to choose the best model from the five candidates.
| Predictor Count | Salary Included | LOOCV_RMSE | AVG % Error | Test RMSE | |
|---|---|---|---|---|---|
| Smaller Backward Search: BIC Model | 10 | FALSE | 4.95 | 5.51 | 0.47 |
| Smaller Step Search: BIC Model | 9 | TRUE | 2.99 | 3.17 | 0.32 |
| Smaller Step Search: AIC Model | 9 | FALSE | 4.05 | 4.07 | 0.97 |
| Smaller Forward Search: BIC Model | 6 | FALSE | 4.20 | 5.46 | 1.55 |
| Smaller Forward Search AIC Model | 15 | TRUE | 6.27 | 6.27 | 3.25 |
The winning model – Smaller Step Search: BIC – adds validity to the axiom that pitching wins games The key predictors that influence the number of regular season wins by an MLB team are related to pitching.
It is moderately surprising that offense related predictors do not dominate the model. (One of the authors of this study is a die-hard New York Yankees fan. The New York Yankees are well known for slightly above average pitching but dominant offense. The Yankees have 27 World Championships to date. That said, the last championship occured nine years ago. So maybe there is more to good pitching that meets the eye.) The one offense related predictor that mmoves the needle is Runs. Holding all other predictors constant, an increase in 1 run per game increases the average number of wins by 0.10.
The most exciting play in baseball, the Triple, is negatively correlated with wins. Holding all other predictors constant, an increase of 1 Triple reduces the average number of wins by 0.5. This is a real “head-scratcher”.
Let’s turn our attention to logistic regression and our ability to classify and predict division winners using our team predictor set. First, let’s see how well salary alone can predict pennants.
salary_only_model <- glm(DivWin ~ salary, data = bbproj_trn, family = binomial)
(salary_only_summary = summary(salary_only_model))
##
## Call:
## glm(formula = DivWin ~ salary, family = binomial, data = bbproj_trn)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5538 -0.6746 -0.5664 -0.4673 2.1545
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.693550028350 0.317777033524 -8.476 < 2e-16 ***
## salary 0.000000015283 0.000000003249 4.704 0.00000255 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 420.34 on 419 degrees of freedom
## Residual deviance: 397.38 on 418 degrees of freedom
## AIC: 401.38
##
## Number of Fisher Scoring iterations: 4
plot(as.numeric(DivWin) - 1 ~ salary, data = bbproj_trn,
pch = 20,
main = "Probability of Buying a Division Winner",
ylab = "Probability of Winning Division",
xlab = "Team Salary ($)",
xlim = c(0, 3e8))
curve(predict(salary_only_model, data.frame(salary = x), type = "response"),
add = TRUE,
col = "tomato",
lty = 2,
lwd = 2)
Graphically, a team’s payroll seems to be moderately influential in predicting their chances of taking home a division crown. Indeed, our Wald test for salary alone yields a p-value of 0.0000025, allowing us to reject the null hypothesis (\(H_0 : \beta_{salary} = 0\)) for any reasonable value for \(\alpha\). So what happens when we look at our salary model’s misclassification rate?
salary_prediction <- ifelse(predict(salary_only_model,
bbproj_tst,
type = "response") > 0.5, "Y", "N")
(prevalence = table(bbproj_tst$DivWin) / nrow(bbproj_tst))
##
## N Y
## 0.8 0.2
(salary_misclass = mean(salary_prediction != bbproj_tst$DivWin))
## [1] 0.2222222
First, let’s examine prevalence of division winners. We see that 20% of the teams were division winners, which makes sense, since there are 6 divisions and 30 MLB teams, therefore you will only have 6 division winners per year. Our salary model has a misclassification rate of 0.222, which is worse than our prevalence. We would have a better misclassification rate if we simply stated that there are no division winners! We can certainly do better than this.
We will begin our search for a better classifier by setting up an initial logistic regression model contain all of the predictors we would like to evaluate. Then, we will proceed to eliminate predictors using backwards, forwards, and stepwise AIC and BIC searches. See Appendix C for a complete evaluation of all of the models. The following will detail the methodology using our best candidate model.
scope <- DivWin ~ W + R + AB + H + X2B + X3B + HR + BB + SO + SB + CS + HBP +
SF + RA + ER + ERA + CG + SHO + SV + IPouts + HA + HRA + BBA + SOA + E +
DP + FP + attendance + BPF + PPF + salary + RBI + GIDP + IBB + TB + SLG +
OBP + OPS + WHIP + BABIP + RC + X1B + uBB + wOBA
start_model <- glm(DivWin ~ 1, data = bbproj_trn, family = binomial)
n <- length(resid(start_model))
bic_model <- step(start_model, direction = "forward", scope = scope,
k = log(n), trace = 0)
summary(bic_model)
##
## Call:
## glm(formula = DivWin ~ W + X2B, family = binomial, data = bbproj_trn)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.35987 -0.30139 -0.06081 -0.00867 2.59348
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -27.523331 3.832156 -7.182 0.000000000000686 ***
## W 0.350030 0.042623 8.212 < 2e-16 ***
## X2B -0.016344 0.006758 -2.418 0.0156 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 420.34 on 419 degrees of freedom
## Residual deviance: 194.55 on 417 degrees of freedom
## AIC: 200.55
##
## Number of Fisher Scoring iterations: 7
Our forward search using BIC yields the following logistic model
\[ \log \bigg(\frac{P[DivWin = 1]}{1 - P[DivWin = 1]} \bigg) = \beta_0 + \beta_W W + \beta_{X2B} X2B \]
where the log odds of a team winning their division is dependent upon wins and doubles.
Next, we will create a function that will evaluate this model using a confusion matrix and calculate its sensitivity, specificity, and its misclassification rate (the classify_and_diagnose function code can be found in Appendix C). Then we will start by examining a default cutoff value of 0.5.
(bic_diag <- classify_and_diagnose(bic_model))
## $confusion_matrix
## actual
## prediction N Y
## N 70 7
## Y 2 11
##
## $sensitivity
## [1] 0.6111111
##
## $specificity
## [1] 0.9722222
##
## $misclassification
## [1] 0.1
Here, we beat the salary only model along with the prevalence with a misclassification rate of 0.1. Our specificity looks great for our forward BIC model at 0.972. Our 7 false negatives could come down, however, with an adjustment to our cutoff value.
Next we will define a function that will loop through a vector of potential cutoffs to isolate one that will produce the smallest misclassification rate with a minimal differential between sensitivity and specificity (see Appendix C for the opt_logistic_cutoff function’s definition).
(opt_cutoff = opt_logistic_cutoff(bic_model, cut_start = 0.1, cut_end = 0.8))
## cutoff misclass sensitivity specificity
## 0.30000000 0.05555556 0.88888889 0.95833333
classify_and_diagnose(bic_model, cutoff = opt_cutoff["cutoff"])$confusion_matrix
## actual
## prediction N Y
## N 69 2
## Y 3 16
Our routine finds an optimal cutoff value for our forward BIC model of 0.3. Our misclassification rate went down with our cutoff adjustment to 0.056. Additionally, our sensitivity went up significantly (0.889) and we only had a modest reduction in specificity (0.958). This model looks to be exceptionally adept at classifying division winners, and salary is not even a predictor!
library(knitr)
kable(logistic_results, col.names = logistic_col_names,
caption = "Logistic Regression Model Result Summary")
| Cutoff | Misclassification Rate | Sensitivity | Specificity | p | False Negatives | False Positives | |
|---|---|---|---|---|---|---|---|
| Forward AIC | 0.50 | 0.1111111 | 0.5555556 | 0.9722222 | 6 | 8 | 2 |
| Backward AIC | 0.50 | 0.0777778 | 0.7222222 | 0.9722222 | 21 | 5 | 2 |
| Stepwise AIC | 0.50 | 0.1111111 | 0.5555556 | 0.9722222 | 6 | 8 | 2 |
| Forward BIC | 0.50 | 0.1000000 | 0.6111111 | 0.9722222 | 3 | 7 | 2 |
| Backward BIC | 0.50 | 0.1000000 | 0.6111111 | 0.9722222 | 11 | 7 | 2 |
| Stepwise BIC | 0.50 | 0.1000000 | 0.6111111 | 0.9722222 | 3 | 7 | 2 |
| Forward AIC (Optimal Cutoff) | 0.23 | 0.0777778 | 0.9444444 | 0.9166667 | 6 | 1 | 6 |
| Backward AIC (Optimal Cutoff) | 0.26 | 0.0555556 | 0.8888889 | 0.9583333 | 21 | 2 | 3 |
| Stepwise AIC (Optimal Cutoff) | 0.23 | 0.0777778 | 0.9444444 | 0.9166667 | 6 | 1 | 6 |
| Forward BIC (Optimal Cutoff) | 0.30 | 0.0555556 | 0.8888889 | 0.9583333 | 3 | 2 | 3 |
| Backward BIC (Optimal Cutoff) | 0.26 | 0.0555556 | 0.8333333 | 0.9722222 | 11 | 3 | 2 |
| Stepwise BIC (Optimal Cutoff) | 0.30 | 0.0555556 | 0.8888889 | 0.9583333 | 3 | 2 | 3 |
Forward-search BIC model with an optimal cutoff of 0.3.
\[ \log \bigg(\frac{P[DivWin = 1]}{1 - P[DivWin = 1]} \bigg) = \beta_0 + \beta_W W + \beta_{X2B} X2B \]
We choose the forward-search BIC model with a cutoff value of 0.3 as our best predictive model. While there are a few models that have an equivalent minimal misclassification rate, the forward BIC model has the least number of predictors and therefore the easiest to interpret. Additionally, the sensitivity and specificity (false negatives and false positives, respectively) are minimized and nearly equivalent.
Our winning model tells us that a team’s win total W and double tally X2B has a significant relationship with that team’s probability of winning their division. The relationship with W is fairly self evident and quite boring given the definition of a division winner - the team with the most wins in their division at the end of the season is declared the division champ. Interestingly, however, there is a somewhat-significant negative relationship with doubles. That is, given two teams with the same number of wins, the team with the fewer number of doubles would have a higher probability of winning their division. What does this mean? Well, one explanation is, given all else equal, it’s possible the game favors teams proficient at small ball - being able to manufacture runs with walks, bunts, steals, and sacrifice flies - rather than a team’s capacity of lacing ropes to the alley in left-center.
classify_and_diagnose = function(model, data = bbproj_tst,
actual = bbproj_tst$DivWin,
pos = "Y", neg = "N", cutoff = 0.5) {
# Generate a classifier given the model, data, cutoff, and positive and
# negative responses
pred = ifelse(predict(model,
data,
type = "response") > cutoff, pos, neg)
# Generate the confusion matrix
conf_mat = table(prediction = pred, actual = actual)
# Calculate sensitivity, specificity, and the misclassification rate and
# return them plus the confusion matrix
list(confusion_matrix = conf_mat,
sensitivity = conf_mat[2, 2] / sum(conf_mat[, 2]),
specificity = conf_mat[1, 1] / sum(conf_mat[, 1]),
misclassification = mean(pred != actual))
}
opt_logistic_cutoff = function(model, cut_start = 0.01, cut_end = 0.99,
data = bbproj_tst, actual = bbproj_tst$DivWin,
pos = "Y", neg = "N", plotit = TRUE) {
# Loop through potential cutoffs from cut_start to cut_end to determine a
# cutoff that produces the lowest misclassification rate with the smallest
# delta between sensitivity and specificity
cutoffs = seq(cut_start, cut_end, by = 0.01)
sens = rep(0, length(cutoffs))
spec = rep(0, length(cutoffs))
misclass = rep(0, length(cutoffs))
delta = rep(0, length(cutoffs))
for (i in 1:length(cutoffs)) {
diagnostics = classify_and_diagnose(model,
data = data,
actual = actual,
pos = pos,
neg = neg,
cutoff = cutoffs[i])
sens[i] = diagnostics$sensitivity
spec[i] = diagnostics$specificity
misclass[i] = diagnostics$misclassification
delta[i] = abs(diagnostics$sensitivity - diagnostics$specificity)
}
# Get the indicies of the smallest misclassification rates
min_misclass = which(misclass == min(misclass))
# Get the smallest delta between sensitivity and specificity at the
# identified misclassification indicies
min_delta = min_misclass[which.min(delta[min_misclass])]
# Plot sensitivity, specificity, and misclassification if requested
if (plotit) {
plot(sens ~ cutoffs,
xlab = "Cutoff",
ylab = "Sensitivity/Specificity",
main = "Sensitivity and Specificity at Varied Cutoffs",
col = "tomato",
type = "b",
ylim = c(0, 1),
pch = 20)
lines(spec ~ cutoffs,
col = "darkslategray4",
type = "b",
pch = 20)
lines(misclass ~ cutoffs,
col = "darkorange",
type = "b",
pch = 20)
abline(v = cutoffs[min_delta], lty = 2)
legend("topright", c("Sensitivity",
"Specificity",
"Misclassification"),
col = c("tomato", "darkslategray4", "darkorange"),
lwd = 1,
pch = 20)
}
# Return the misclassification, sensitivity, and specificity at the optimal
# cutoff value
c(cutoff = cutoffs[min_delta],
misclass = misclass[min_delta],
sensitivity = sens[min_delta],
specificity = spec[min_delta])
}
formula <- formula(scope)
back_bic_model <- step(glm(formula, data = bbproj_trn, family = binomial),
direction = "backward", k = log(n), trace = 0)
back_bic_diag <- classify_and_diagnose(back_bic_model)
summary(back_bic_model)
##
## Call:
## glm(formula = DivWin ~ W + AB + H + HR + BB + HBP + IBB + SLG +
## OBP + wOBA, family = binomial, data = bbproj_trn)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.09816 -0.27642 -0.04275 -0.00332 2.76522
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1052.50217 319.70312 -3.292 0.000994 ***
## W 0.37031 0.04793 7.726 0.0000000000000111 ***
## AB 0.17509 0.05585 3.135 0.001717 **
## H -0.52892 0.16383 -3.228 0.001245 **
## HR 0.09433 0.03603 2.618 0.008842 **
## BB -0.34709 0.10338 -3.357 0.000787 ***
## HBP -0.33911 0.09686 -3.501 0.000463 ***
## IBB -0.30746 0.12140 -2.533 0.011320 *
## SLG 1794.92000 702.65755 2.554 0.010635 *
## OBP 6178.22128 2030.78091 3.042 0.002348 **
## wOBA -5414.02198 2077.07626 -2.607 0.009146 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 420.34 on 419 degrees of freedom
## Residual deviance: 179.06 on 409 degrees of freedom
## AIC: 201.06
##
## Number of Fisher Scoring iterations: 8
back_bic_diag
## $confusion_matrix
## actual
## prediction N Y
## N 70 7
## Y 2 11
##
## $sensitivity
## [1] 0.6111111
##
## $specificity
## [1] 0.9722222
##
## $misclassification
## [1] 0.1
opt_logistic_cutoff(back_bic_model, cut_start = 0.1, cut_end = 0.8)
## cutoff misclass sensitivity specificity
## 0.26000000 0.05555556 0.83333333 0.97222222
opt_back_bic_diag$confusion_matrix
## actual
## prediction N Y
## N 70 3
## Y 2 15
forw_bic_model <- step(start_model, direction = "forward", scope = scope,
k = log(n), trace = 0)
forw_bic_diag <- classify_and_diagnose(forw_bic_model)
summary(forw_bic_model)
##
## Call:
## glm(formula = DivWin ~ W + X2B, family = binomial, data = bbproj_trn)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.35987 -0.30139 -0.06081 -0.00867 2.59348
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -27.523331 3.832156 -7.182 0.000000000000686 ***
## W 0.350030 0.042623 8.212 < 2e-16 ***
## X2B -0.016344 0.006758 -2.418 0.0156 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 420.34 on 419 degrees of freedom
## Residual deviance: 194.55 on 417 degrees of freedom
## AIC: 200.55
##
## Number of Fisher Scoring iterations: 7
forw_bic_diag
## $confusion_matrix
## actual
## prediction N Y
## N 70 7
## Y 2 11
##
## $sensitivity
## [1] 0.6111111
##
## $specificity
## [1] 0.9722222
##
## $misclassification
## [1] 0.1
opt_logistic_cutoff(forw_bic_model, cut_start = 0.1, cut_end = 0.8)
## cutoff misclass sensitivity specificity
## 0.30000000 0.05555556 0.88888889 0.95833333
opt_forw_bic_diag$confusion_matrix
## actual
## prediction N Y
## N 69 2
## Y 3 16
step_bic_model <- step(start_model, direction = "both", scope = scope,
k = log(n), trace = 0)
step_bic_diag <- classify_and_diagnose(step_bic_model)
summary(step_bic_model)
##
## Call:
## glm(formula = DivWin ~ W + X2B, family = binomial, data = bbproj_trn)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.35987 -0.30139 -0.06081 -0.00867 2.59348
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -27.523331 3.832156 -7.182 0.000000000000686 ***
## W 0.350030 0.042623 8.212 < 2e-16 ***
## X2B -0.016344 0.006758 -2.418 0.0156 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 420.34 on 419 degrees of freedom
## Residual deviance: 194.55 on 417 degrees of freedom
## AIC: 200.55
##
## Number of Fisher Scoring iterations: 7
step_bic_diag
## $confusion_matrix
## actual
## prediction N Y
## N 70 7
## Y 2 11
##
## $sensitivity
## [1] 0.6111111
##
## $specificity
## [1] 0.9722222
##
## $misclassification
## [1] 0.1
opt_step_bic <- opt_logistic_cutoff(step_bic_model,
cut_start = 0.1,
cut_end = 0.8,
plotit = FALSE)
opt_step_bic_diag <- classify_and_diagnose(step_bic_model,
cutoff = opt_step_bic["cutoff"])
opt_logistic_cutoff(step_bic_model, cut_start = 0.1, cut_end = 0.8)
## cutoff misclass sensitivity specificity
## 0.30000000 0.05555556 0.88888889 0.95833333
opt_step_bic_diag$confusion_matrix
## actual
## prediction N Y
## N 69 2
## Y 3 16
formula <- formula(scope)
back_aic_model <- step(glm(formula, data = bbproj_trn, family = binomial),
direction = "backward", trace = 0)
back_aic_diag <- classify_and_diagnose(back_aic_model)
summary(back_aic_model)
##
## Call:
## glm(formula = DivWin ~ W + R + AB + H + HR + BB + CS + HBP +
## ER + IPouts + E + FP + BPF + PPF + RBI + IBB + SLG + OBP +
## OPS + wOBA, family = binomial, data = bbproj_trn)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.96822 -0.20843 -0.02111 -0.00065 2.65771
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -140.179255 693.655494 -0.202 0.839848
## W 0.545656 0.090798 6.010 0.00000000186 ***
## R -0.076857 0.035782 -2.148 0.031717 *
## AB 0.183520 0.063041 2.911 0.003601 **
## H -0.592491 0.183941 -3.221 0.001277 **
## HR 0.079762 0.041066 1.942 0.052099 .
## BB -0.401042 0.116424 -3.445 0.000572 ***
## CS -0.059004 0.023336 -2.528 0.011457 *
## HBP -0.393193 0.110648 -3.554 0.000380 ***
## ER 0.024676 0.008386 2.943 0.003255 **
## IPouts 0.022175 0.011255 1.970 0.048806 *
## E -0.168360 0.092043 -1.829 0.067378 .
## FP -1111.221700 583.517366 -1.904 0.056865 .
## BPF 0.736250 0.339446 2.169 0.030084 *
## PPF -0.674039 0.343047 -1.965 0.049431 *
## RBI 0.070973 0.036748 1.931 0.053438 .
## IBB -0.299451 0.134303 -2.230 0.025769 *
## SLG 851959707.701375 417793051.690715 2.039 0.041431 *
## OBP 851964540.159953 417793092.777287 2.039 0.041430 *
## OPS -851957944.218280 417793035.107366 -2.039 0.041431 *
## wOBA -5332.125383 2319.746099 -2.299 0.021529 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 420.34 on 419 degrees of freedom
## Residual deviance: 151.22 on 399 degrees of freedom
## AIC: 193.22
##
## Number of Fisher Scoring iterations: 8
back_aic_diag <- classify_and_diagnose(back_aic_model)
opt_back_aic <- opt_logistic_cutoff(back_aic_model,
cut_start = 0.1,
cut_end = 0.8,
plotit = FALSE)
opt_back_aic_diag <- classify_and_diagnose(back_aic_model,
cutoff = opt_back_aic["cutoff"])
opt_logistic_cutoff(back_aic_model, cut_start = 0.1, cut_end = 0.8)
## cutoff misclass sensitivity specificity
## 0.26000000 0.05555556 0.88888889 0.95833333
opt_back_aic_diag$confusion_matrix
## actual
## prediction N Y
## N 69 2
## Y 3 16
forw_aic_model <- step(start_model, direction = "forward",
scope = scope, trace = 0)
forw_aic_diag <- classify_and_diagnose(forw_aic_model)
summary(forw_aic_model)
##
## Call:
## glm(formula = DivWin ~ W + X2B + HA + DP + GIDP, family = binomial,
## data = bbproj_trn)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.31485 -0.25596 -0.04893 -0.00565 2.90172
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -37.894889 6.186389 -6.126 9.04e-10 ***
## W 0.392678 0.048963 8.020 1.06e-15 ***
## X2B -0.024866 0.007752 -3.208 0.00134 **
## HA 0.006988 0.002855 2.447 0.01440 *
## DP -0.024295 0.012063 -2.014 0.04401 *
## GIDP 0.021012 0.012242 1.716 0.08609 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 420.34 on 419 degrees of freedom
## Residual deviance: 182.92 on 414 degrees of freedom
## AIC: 194.92
##
## Number of Fisher Scoring iterations: 7
forw_aic_diag
## $confusion_matrix
## actual
## prediction N Y
## N 70 8
## Y 2 10
##
## $sensitivity
## [1] 0.5555556
##
## $specificity
## [1] 0.9722222
##
## $misclassification
## [1] 0.1111111
opt_forw_aic <- opt_logistic_cutoff(forw_aic_model,
cut_start = 0.1,
cut_end = 0.8,
plotit = FALSE)
opt_forw_aic_diag <- classify_and_diagnose(forw_aic_model,
cutoff = opt_forw_aic["cutoff"])
opt_logistic_cutoff(forw_aic_model, cut_start = 0.1, cut_end = 0.8)
## cutoff misclass sensitivity specificity
## 0.23000000 0.07777778 0.94444444 0.91666667
opt_forw_aic_diag$confusion_matrix
## actual
## prediction N Y
## N 66 1
## Y 6 17
step_aic_model <- step(start_model, direction = "both",
scope = scope, trace = 0)
step_aic_diag <- classify_and_diagnose(step_aic_model)
summary(step_aic_model)
##
## Call:
## glm(formula = DivWin ~ W + X2B + HA + DP + GIDP, family = binomial,
## data = bbproj_trn)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.31485 -0.25596 -0.04893 -0.00565 2.90172
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -37.894889 6.186389 -6.126 9.04e-10 ***
## W 0.392678 0.048963 8.020 1.06e-15 ***
## X2B -0.024866 0.007752 -3.208 0.00134 **
## HA 0.006988 0.002855 2.447 0.01440 *
## DP -0.024295 0.012063 -2.014 0.04401 *
## GIDP 0.021012 0.012242 1.716 0.08609 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 420.34 on 419 degrees of freedom
## Residual deviance: 182.92 on 414 degrees of freedom
## AIC: 194.92
##
## Number of Fisher Scoring iterations: 7
step_aic_diag
## $confusion_matrix
## actual
## prediction N Y
## N 70 8
## Y 2 10
##
## $sensitivity
## [1] 0.5555556
##
## $specificity
## [1] 0.9722222
##
## $misclassification
## [1] 0.1111111
opt_step_aic <- opt_logistic_cutoff(step_aic_model,
cut_start = 0.1,
cut_end = 0.8,
plotit = FALSE)
opt_step_aic_diag <- classify_and_diagnose(step_aic_model,
cutoff = opt_step_aic["cutoff"])
opt_logistic_cutoff(step_aic_model, cut_start = 0.1, cut_end = 0.8)
## cutoff misclass sensitivity specificity
## 0.23000000 0.07777778 0.94444444 0.91666667
opt_step_aic_diag$confusion_matrix
## actual
## prediction N Y
## N 66 1
## Y 6 17